-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
source-mysql: Decode text correctly in non-UTF8 character sets #1979
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is a new test case which exercises `text` columns with various different colations / character sets storing some test strings with interesting characters from various languages. I believe the current set of collations and test strings is enough to demonstrate the known issues with our current handling of these edge cases: - A `latin1` collation stores strings in the latin-1 character set and then we cast those raw bytes to a string which causes all of the non-ASCII characters to be replaced with U+FFFD. - A `ucs2` collation stores strings in the UCS-2 DBCS and so for similar reasons all replicated values are terribly mangled. - A `binary` collation is apparently captured as base64 bytes, because apparently if you tell MySQL `TEXT COLLATE binary` it creates a `BLOB` column. This is not an error but just seemed worth noting and including in the test here. The base64'd text appears to be a faithful base64 representation of the input as a UTF-8 string.
Modifies the column-discovery and primary-key-discovery queries to apply the same "not in a system schema" filtering that the table discovery query has, and then modifies column discovery so that the raw information is logged for each column.
This commit adds most of the necessary plumbing to keep track of the character set of a text column and apply a charset-aware decode function instead of just casting the bytes to a string. However it does not actually implement the proper decoding and instead just still does `var str = string(bs)` at the appropriate spot with a TODO noting that's wrong. The next commit will come in and actually implement proper decoding.
Currently this doesn't work and I've stubbed out the logic with a hard-coded default of `utf8mb4` but now there's a test case which will show the corrected values when I fix that.
This works just fine, which is to be expected because we haven't changed anything and we already knew that latin-1 character set text columns work fine under backfills.
This more or less preserves the old behavior for any charsets we haven't thought to add to the decoders map, but logs an error so I can check for it in a few days and add any others we might be missing.
And use that collation when processing DDL alterations which don't explicitly specify another collation or charset.
In this case we want to apply the same "charset from collation" mapping function that we use during discovery. Now the hierarchy of column charsets goes: 1. Explicit CHARSET declaration 2. Explicit COLLATE declaration 3. Default for the table (which omits utf8mb4 in some cases) 4. Default to utf8mb4 as the last resort
jgraettinger
approved these changes
Sep 27, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
It was always the plan to include the table charset in the table metadata unconditionally, I just left that part for a followup commit to keep the diffs separate. This removes the `"utf8mb4" -> ""` omission from the replication code so that it's always made explicit what charset a table is using in metadata initialized after this change. The default of `"" -> "utf8mb4"` still exists in the DDL alteration datatype translation so that will always be explicit for newly added columns, and likewise the string decoding function defaults `"" -> "utf8mb4"` so that old metadata works, but after this change we always specify charset information explicitly when generating new metadata.
willdonnelly
force-pushed
the
wgd/debugging-20240916
branch
from
September 30, 2024 17:23
e0017f0
to
391d7c7
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
This PR fixes
source-mysql
to correctly decode text columns with non-UTF8 character sets. This fixes #1951Previously backfilled values were correct, because MySQL always sends us those in UTF-8, but replicated values were occasionally incorrect (or at least, sub-optimal) because the bytes stored in the binlog are in whatever character set the column uses, and at a certain point we just cast
[]byte -> string
which results in any non-ASCII bytes getting replaced withU+FFFD REPLACEMENT CHARACTER
.The fix is:
ALTER TABLE
query occurring after the table backfill which adds a new column where the character set could be explicit, could be implied by an explicit collation, or could be the table default).[]byte -> string
decode function to text column bytes received via replication, based on the character set of the column. This part is reasonably straightforward, but required adding an extra parameter to the value translation code to distinguish between backfilled values (which are always UTF-8) and replicated ones (where we need to do all this fanciness).I have opted to make the connector default to a UTF-8 decoder when faced with an unknown or unrecognized charset rather than failing the capture. This would probably be the wrong choice if the connector were new, but right now we don't know if there are any captures in production using some other character set which I overlooked, so the safest behavior is to keep doing the same thing as before this PR (decoding the bytes by casting to a UTF-8 string directly) while logging an error. Later on I can go check to see if that error actually occurred in production anywhere, add support for the offending character sets, and consider making "unknown charset" a fatal error at that time.
Workflow steps:
Most captures don't need to care or do anything. Captures will continue operating as normal for preexisting bindings (because their metadata for all tables and columns has no charset information, which means an implicit default of UTF-8). New captures, or new bindings or backfills on existing captures, will reinitialize that metadata and pick up the appropriate charset information. In most cases this will have no impact because the charset really is UTF-8.
Captures of latin-1 or other non-Unicode text columns which actually contain non-ASCII code points currently being captured as
U+FFFD REPLACEMENT CHARACTER
can be fixed to capture the correct values for these non-ASCII characters by re-backfilling the binding.This change is